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It has long been known that for the comparison of pairwise nested 
models, a decision based on the Bayes factor produces a consistent 
model selector (in the frequentist sense) . Here we go beyond the usual 
consistency for nested pairwise models, and show that for a wide 
class of prior distributions, including intrinsic priors, the correspond- 
ing Bayesian procedure for variable selection in normal regression is 
consistent in the entire class of normal linear models. We find that 
the asymptotics of the Bayes factors for intrinsic priors are equiv- 
alent to those of the Schwarz (BIC) criterion. Also, recall that the 
Jeffreys-Lindley paradox refers to the well-known fact that a point 
null hypothesis on the normal mean parameter is always accepted 
when the variance of the conjugate prior goes to infinity. This im- 
plies that some limiting forms of proper prior distributions are not 
necessarily suitable for testing problems. Intrinsic priors are limits 
of proper prior distributions, and for finite sample sizes they have 
been proved to behave extremely well for variable selection in regres- 
sion; a consequence of our results is that for intrinsic priors Lindley's 
paradox does not arise. 

1. Introduction. Bayesian estimation of the parameters of a given sam- 
pling model is, under wide conditions, consistent. That is, the posterior 
probability of the parameter is concentrated around the true value as the 
sample size increases, assuming that the true value belongs to the parame- 
ter space being considered. The case where the dimension of the parameter 
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space is infinite can be an exception [see Diaconis and Friedman (1986) for 
examples of inconsistency of Bayesian methods] . 

When several competing models are deemed possible, so that we have 
uncertainty among them, consistency of a Bayesian model selection proce- 
dure is much more involved. For instance, it is well known that improper 
priors for the model parameters cannot be used for computing posterior 
model probabilities. Therefore, the priors need be either proper or limits of 
sequences of proper priors. Furthermore, not every limit of proper priors is 
appropriate for a Bayesian model selection. 

The so-called Lindley paradox is an example of this [Lindley (1957) and 
Jeffreys (1967)]; it shows that when testing a point null hypothesis on the 
normal mean parameter we always accept the null if a conjugate prior is 
considered on the alternative and the variance of this conjugate prior goes 
to infinity. As Robert (1993) has pointed out, this is not a mathematical 
paradox since the prior sequence is giving less and less mass to any neigh- 
borhood of the null point as the prior variance goes to infinity. However, an 
important consequence of the paradox is that some limiting forms of proper 
priors might not be suitable for testing problems as they could provide incon- 
sistency of the corresponding Bayes factors. We remark that intrinsic priors 
are limits of sequences of proper priors [Moreno, Bertolino and Racugno 
(1998)] and for finite sample sizes an intrinsic Bayesian analysis has been 
proved to behave extremely well for variable selection in regression [Casella 
and Moreno (2006), Giron, Moreno and Martinez (2006) and Moreno and 
Giron (2008)]. Consequently, showing that the Lindley paradox does not 
occur when using intrinsic priors is an important point. 

For nested models and proper priors for the model parameters, the con- 
sistency of the Bayesian pairwise model comparison is a well established 
result [see O'Hagan and Forster (2004) and references therein]. Assuming 
that we are sampling from one of the models, say M\, which is nested in 
M2, consistency is understood in the sense that the posterior probability of 
the true model tends to 1 as the sample size tends to infinity. We observe 
that the posterior probability is defined on the space of models {Mi, M2}. 
An equivalent result is that the Bayes factor BF 21 = TO2(X n )/mi(X n ) tends 
in probability [Pi] to zero, where X n = (X%, . . . ,X n ). 

The extension of this result to the case of a collection of models {Pi : i = 
1,2,...}, for which the condition lim n ^ 00 mj(X n )/?ni(X n ) =0, [Pi] holds 
for any i > 2, has been established by Dawid (1992). We note that this 
condition is satisfied when the model Pi is nested in any other. For nonnested 
models the condition does not necessarily hold. As far as we know, a general 
consistency result for the Bayesian model selection procedure for nonnested 
models has not yet been established. This paper is a step forward in this 
direction and proves the consistency of Bayesian model selection procedures 
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for normal linear models and a wide class of prior distributions, including 
the intrinsic priors. 

For pairwise comparison between nested linear models the consistency 
of the intrinsic Bayesian procedure has already been established [Moreno 
and Giron (2005)]. The present paper is an extension of this result, and we 
prove here consistency of the intrinsic model posterior probabilities in the 
class of all linear models, where many of the models involved are nonnested. 
We also extend this result to a wide class of prior distributions. In proving 
consistency we take advantage of the nice asymptotic behavior of the Bayes 
factors arising from intrinsic priors. It is important to note we are assuming 
that the total number of regressors, k, is fixed and hence does not grow with 
n. For a consistency analysis where k also grows with n, see Shao (1997). 

The rest of the paper is organized as follows. In Section 2 we review 
methods for variable selection based on intrinsic priors and the expressions 
of Bayes factors and posterior model probabilities. In Section 3 we derive the 
sampling distributions of the statistic Bfj, the statistic on which the Bayes 
factor for comparing two nested models depends, and we also describe its 
limiting behavior. This will be the tool we use in Section 4 to find out 
an asymptotic approximation of the Bayes factor for intrinsic priors, and 
to prove consistency of the variable selection procedure. Section 5 provides 
an evaluation of the intrinsic Bayes procedure and BIC for small sample 
sizes, and Section 6 contains a concluding discussion. There is also a short 
technical Appendix. 

2. Intrinsic Bayesian procedures for variable selection. Suppose that Y 
represents an observable random variable and X\ , X 2 , . . . , X k a set of k po- 
tential explanatory covariates related through the normal linear model 

Y = aiXt + a 2 X 2 + ■■■ + a k X k + e, e - N(0, a 2 ). 

The variable selection problem consists of reducing the complexity of this 
model by identifying a subset of the cvj coefficients that have a zero value 
based on an available dataset (y, X), where y is a vector of observations of 
size n and X an n x k design matrix of full rank. 

This is by nature a model selection problem where we have to choose a 
model among the 2 k possible submodels of the above full one. It is common 
to set X\ = 1 and a\ ^ to include the intercept in any model. In this 
case the number of possible submodels is 2 fe_1 . The class of models with i 
regressors will be denoted as 9JTj and hence the class of all possible submodels 
can be written as 9JT = U« %$h- 

2.1. Methods of encompassing. A fully Bayesian objective analysis for 
model comparison in linear regression has been given in Casella and Moreno 
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(2006). It consists of considering the pairwise model comparison between 
the full model Mp and a generic submodel Mj 3 having i (< k) nonzero 
regression coefficients. Formally, they test the hypothesis 

(1) H : Model M» versus Ha : Model Mp. 

Since Mj is nested in the full model Mp, it is possible to derive the intrinsic 
priors for the parameters of both models. Then, in the space of models 
{Mi,Mp} the intrinsic posterior probability of Mj is computed using 

?7&i(y, X) BFih 

P(Mi\y,X) - 



m;(y,X) + m fc (y,X) l + BF ik ' 

where BFi k is the Bayes factor for comparing model Mj to model Mp. By 
doing this for every model an ordering of the set of models, in accordance 
to their posterior probabilities {P(Mj|y,X) = BF ik /(l + BF ik ),Mi G M}, 
is obtained. The interpretation is that the submodel having the highest 
posterior probability is the most plausible reduction in complexity from the 
full model, the second highest the second-most plausible reduction and so on. 
This intrinsic Bayesian method for variable selection will be called Variable 
Selection from Above (VSA). 

If we normalize the Bayes factors for intrinsic priors {BFi k ,i > 1}, we 
obtain a set of probabilities on the class SOT as 

BF h 

(2) P(M i ;y,X) = — - lk , M € 

l + l^i'^k Ut ^ i'k 

but we note that these probabilities are not true posterior probabilities of 
the models in the class 971, although the ordering of the models they provide 
is exactly the same than that given by the above pairwise variable selection 
from above. 

However, the manner of encompassing the models is not unique, and a 
quite natural alternative to VSA is to consider the pairwise model compar- 
ison between a generic submodel Mj and the model 

y = oi + e, e-N(-|0,o- 2 ), 

that contains the intercept only, which is denoted as M±. Formally, this 
comparison is made through the hypothesis test 

(3) H Q : Model M x versus H A : Model Mj . 



3 We use Mi to denote any model with i regressors; there are ( fc 7 1 ) suc h models. 
However, the development in the paper will be clear using this somewhat ambiguous, but 
simpler, notation. 
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Notice that M\ is nested in Mj, for any j, so that the corresponding intrinsic 
priors can be derived. In the space of models {Mi, Mj} the intrinsic posterior 
probability 

P(Mj\y,X)- 



1 + BFji 

is computed and it gives a new ordering of the models {Mj,Mj £ 9Jt}. 

Although this alternative procedure is also based on multiple pairwise 
comparisons it is easy to see that it is equivalent to ordering the models 
according to the intrinsic model posterior probabilities computed in the 
space of all models 971 as 

BF i 

(4) P(M j \y,X.)= 31 , M 3 eTl. 

This intrinsic Bayesian procedure will be called Variable Selection from Be- 
low (VSB), and has previously been considered by Giron, Moreno and Mar- 
tinez (2006). 

For finite sample sizes, the orderings of the linear models provided by 
both VSA and VSB intrinsic Bayesian procedures are quite close to each 
other [Moreno and Giron (2008)]. 

2.2. Intrinsic priors and Bayes factors. The intrinsic priors utilized in 
the variable selection methods of Section 2.1 are defined from the comparison 
of two nested linear models, and we now give a general expression of the 
intrinsic priors and the Bayes factor associated with them. 

Suppose we want to choose between the following two nested linear models 

Mi-.y = XjC*j + e i; Si ~N n ,(0,ofl n ) 

and 

Mj : y = Xj/3j + Ej , Sj ~ N n (0, ojl n ). 
We again can do this formally through the hypothesis test 

(5) H : Model Mj versus H A : Model Mj , 

where Mj is nested in Mj. Since the models are nested, this implies that the 
nx i design matrix Xj is a submatrix of the n X j design matrix X, , so that 
Xj = (Xj|Zy). Then, model Mj can be written as 

Mj :y = XtPi + ZijP + e jt e 3 - N n (0, a]l n ). 

Comparing model Mj versus Mj is equivalent to testing the hypothesis 
Hq '■ Po = against Hi : (3q ^ 0. A Bayesian setup for this problem is that 
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of choosing between the Bayesian models 



Mi :N n (y|X i ai,cj l 2 I n ), n N (cti, <r») = — and 

(6) 

Mj : N„(y|X .^j, a 2 I n ), T^iPj^j) = 

a j 

where 71"^ denotes the improper reference prior and Cj,Cj are arbitrary con- 
stants [Berger and Bernardo (1992)]. 

The direct use of improper priors for computing model posterior prob- 
abilities is not possible since they depend on the arbitrary constant (n/cj; 
however, they can be converted into suitable intrinsic priors [Berger and 
Pericchi (1996)]. Intrinsic priors for the parameters of the above nested lin- 
ear models provide a Bayes factor [Moreno, Bertolino and Racugno (1998)] 
and, more importantly, posterior probabilities for the models Mi and Mj, 
assuming that prior probabilities are assigned to them. Here we will use an 
objective assessment for this model prior probability, P{Mj) = P(Mj) = 1/2. 

Application of the standard intrinsic prior methodology yields the intrin- 
sic prior distribution for the parameters (3j , aj of model Mj , conditional on 
a fixed parameter point ctj,o"j of the reduced model Mj, 

2 

tt 1 ([3j ,a j \a i ,a i ) = — - — Nj (J3 ■ | ay , [a] + a 2 ) Wj 1 ) , 

where a. a = (0', a ■) with being the null vector of j — i components and 

The unconditional intrinsic prior for (/3j,(jj) is obtained from TT I (f3j,aj) = 
J tt 1 ((3j,<7j\a.i, <Ji)'K N (a.i,(Ji) doti d<Ji, yielding the intrinsic priors for compar- 
ing models Mj and Mj as {tt (aj,CTj),7i (/3,-,<7j)}. The computation of the 
Bayes factor to compare these models using the intrinsic priors is a straight- 
forward calculation (see the Appendix) and turns out to be 

(7) 

sin J ' -< <p(n + (j + 1) sin 2 v) {n ~ j)/2 , 

■ dip 



{nBfj + {j + 1) sin 2 ^("^V 2 

where the statistic Bfj is the ratio of the residual sum of squares 

= RSfy = y'CT-H.Qy 
iJ RSSi ^(I -Hi)y- 

Note that as Mj is nested in Mj the values of the statistic Bfj lie in the 
interval [0, 1] and all of the above expressions are valid. 
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3. Sampling distribution of B^j. If we denote the true model by My, so 

that the distribution of the vector of observations y follows N n (y|X T o:r, <r^I n ), 
the sampling distribution of the statistic Bfj is given in the following theo- 
rem. 

Theorem 1. If Mi is nested in Mj and My is the true model, then the 
sampling distribution of Bfj is the doubly noncentral beta distribution 

Sn.|M T ^Be(^,^;A 1 ,A 2 ), 
where the noncentrality parameters are 

Ai = -^a' T X' T (I - Hj)X T a T 

and 

A 2 = - H 4 )X T a T . 

Proof. The quadratic form of the denominator of the Bfj can be de- 
composed as 

y'(I - H,)y = y'(I - H,-)y + y'(H,- - Hj)y. 

As the matrices (I — Hj) and (Hj — Hj) are idempotent of ranks n — j and 
j — i, respectively, it follows from the generalized Cochran theorem that 
the quadratic form y'(I — H,)y and y'(H.j — Hj)y are independent and 
distributed as x' 2 ( n ~ jj^i) an d x' 2 (j ~~ *j^2), respectively. From this the 
distribution of the statistic Bfj follows, and Theorem 1 is proved. □ 

Note that the models Mj and Mj need not be nested in the true model 
Mt, and the true model is not necessarily nested in Mj or Mj. However, the 
distribution of Bfj simplifies whenever Mj or Mj is the true model. Thus we 
have the following corollary. 

Corollary 1. (i) If the smaller model Mi is the true one, then 

(ii) If the larger model Mj is the true one, then 
B%\Mj~Be(^, J -=^-A\), 

where 

A = 7j-f Q j X j( H i ~ H i) x i Q i- 
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Proof. Part (i) follows from the fact that X^Hj = X'-Hj and part (ii) 
from X£ (U j -U i )=X' j (I-H i ). □ 

The limiting value of Bfj is important because it bears directly on the 
evaluation of the consistency of the Bayes factors. That value is given in the 
following theorem. 



Theorem 2. Let {X n ,n > 1} be a sequence of random variables with 
distribution Be((n — ao)/2,/?o/2; nSi, n#2)j where ao,f3o,Si,S 2 ore positive 
constants. Then: 

(i) the sequence X n converges in probability to the constant 

1 + <5i + S 2 ' 

(ii) if S\ = S 2 =0, i/ien X n degenerates in probability to 1. However, the 
random variable —n/2\ogX n does not degenerate and has an asymptotic 
Gamma distribution, Ga(/?o,l)- 



Proof. Part (i). By definition X n is 

X ,2 (Po,nS 2 



X n =[l + 



X ,2 (n - a ,nSi) 



where X l2 (fio-,nS 2 ) and x' 2 ( n ~ a 0i n Si) are independent random variables 
with noncentral chi-square distributions. If we divide the numerator and 
denominator by n we get 

x n =[i+ T 



W n , 

where V n = x' 2 {Po^nS 2 ) /n and W n = x' 2 { n ~ ocQ,n8\)/n. Their means and 



variances are 



and 



E{V n ) = h + —, E{W n ) = l + Si-^ 
n n 



4S 2 20Q , 2(1 + Si) 2a 



Var(K) = — + -7T, Vai(W, 



n ) ' \ ' • n ! 9 • 

n ra z n n z 



Since the variances go to zero as n goes to infinity, X n degenerates in prob- 
ability to (1 + <5i)/(l + Si + S 2 ) as asserted. 

The remainder of the proof is straightforward and hence is omitted. □ 
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4. Consistency of the VS A and VSB intrinsic Bayesian procedures. The 

steps in proving consistency of the intrinsic Bayesian procedures are: 

1. Derive an asymptotic approximation for the Bayes factor for nested 
models given in expression (7). 

2. From this approximation derive another that is valid for any arbi- 
trary pair of models. 

3. Use Theorems 1 and 2 to prove consistency of the VSB procedure. 

It will also be seen that the asymptotic behavior of the Bayes factor for 
VSA is exactly the same as VSB, and hence the consistency of the former 
procedure also holds. 

This is a useful property of the intrinsic methodology for variable selection 
since any way of encompassing the models to derive the intrinsic priors 
produces essentially the same answer for finite sample sizes and for large 
sample sizes. 

4.1. Asymptotic approximation of BFfj. For large re, we can get an ap- 
proximation of BF^j of (7) that is valid whenever model Mi is nested in Mj. 
The approximation turns out to be equivalent to the Schwarz (1978) Bayes 
factor approximation. 

Theorem 3. When Mi is nested in Mj, for large values of n the Bayes 
factor given in (7) can be approximated by 



(8) BF\ 



_ {j + 1 f-m I{B r 



-i 3 
exp 



■ log re - 



n 



where 



tt/2 



sm J 







1 (ip) exp 



j + l 



sin%) 1 



■log fig 



dtp 



2 V2 



i + 1 



i + 1 j-i + 2 j + l 



1 

1 ~ w 



and \Fi(a\b\z) denotes the Kummer confluent hypergeometric function [see 
Abramowitz and Stegun (1972), Chapter 13]. 



Proof. We can write the integrand of (7) as 
n-j 



sim <p exp 



x exp 



log re + log ( 1 + - sin 2 cp 



n 



n 



log n + log B& + log 1 + 



j + 1 



8111 If 
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■ i-i /i-j, i — n 
= sm J tp exp I — ^— log n H — log B< 

(l + (j + l)/rasin 2 v?)( n -J')/ 2 
X (1 + (j + sin 2 v? )(«-0/2 • 

For large n the numerator of the last factor can be approximated by 

+ 1 ■ 2 V n " i)/2 ~ Ji + i . 2 

1 H sin « exp < sin w 

n J I 2 

and the denominator by 

J + 1 2 \ (n_i)/2 fj' + l 2 

1 + ^ Sin ^ " eXP \^ Sin V 

Therefore, for large n the integrand can be approximated by 

■ j-i , i_n i Kn\ fj + l ■ 2 {-, 1 

sm J ip exp lo s n + g ^'J exp ly^ - sm (p [}~W- 

and thus the Bayes factor (7) by 



BF% « JO" + ^(^/(BPO-'expf ^logn + ^VlogS" 



«j 2 Vul y v y 

where 

I(^.)=^ 7r/2 sm J '-Vexp 
This proves Theorem 3. □ 



i + i . 2 

sm 1 



B n . 



dip. 



We note that I(Bfj) has a finite value for all values of the statistic Bfj 
except when it goes to zero. However, we can see in the proof of Theorem 4 
that Bfj tends to a strictly positive number with probability 1 as moo 
[see expression (14)], so I(Z3™) _1 is finite for all n. 

Therefore, BFfj can be approximated, up to a multiplicative constant, by 
the exponential function in (8). This exponential function turns out to be the 
Schwarz approximation Sg to the Bayes factor for comparing linear models 
[Schwarz (1978)]. Of course, the normal linear models are regular so the 
Laplace approximation can be applied to obtain the Schwarz approximation 
although for intrinsic priors the ratio BFfj/Sfj does not go to 1 [this holds 
only for particular priors; see Kass and Wasserman (1995)]. 

However, for proving consistency we can ignore terms of constant order 
and the Bayes factor for intrinsic priors can be approximated by the Schwarz 
approximation 

(9) BFl « Sg. = exp U^l log n + | log Bg 
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We note that Sfj could provide a crude approximation to BFfj for small 
or moderate sample sizes. In Section 5 we look at small-sample behavior of 
both the Schwarz approximation and the Bayes factor for intrinsic priors. 

4.2. Consistency of the VSB intrinsic Bayesian procedure. Given an ar- 
bitrary model Mj and the true model Mt in the class we will assume 
the design matrix of the linear models satisfy the following condition (D): 
the matrix 

(10) S, T = lim — ^ y—L 

J n— >oo ii 

is a positive semidefinite matrix. This is not a too demanding condition as 
the following example shows. 

Example 1 [Berger and Pericchi (2004)]. Consider the case of testing 
whether the slope of a linear regression is zero. Suppose that the true model 
Mt is the model with regression coefficients {a\,a2), and thus there is only 
one alternative model Mi, the model with only the intercept term a%. Sup- 
pose that there are 2n + 1 observations yielding the design matrix 



1 ... 1 1 ... 1 1 
... 5 ... 5 1 



where 5 is different from zero. Easy calculations show that 

X^I-HQXr (0 

SlT "n^ ^Tl = {o 

which obviously is a positive semidefinite matrix for any positive |<5|, no 
matter how close to zero it is. 

Thus, condition (D) is satisfied even when the samples are coming from 
a model Mt, which is as close to Mi as we want. 

To characterize the asymptotic behavior of the model posterior probabil- 
ities, we can work with BFfj of (8), ignoring the positive terms that do not 
depend on n as we are only interested in limiting values of or oo. 

To test the hypothesis (3) with data (y, X), we note that the intrinsic 
model posterior probability of model Mj, defined in the class of all models 
9Jt given by (4), is an increasing function of BFji, where BFji denotes the 
Bayes factor for intrinsic priors for comparing the nested models Mi versus 
Mj. Hence, from the asymptotic approximation (8) we can write 



P(Mj\y,X) 



BFji 



■,iI{B n i 3 ) - 1 exp | - 3 -^- log n - (n/2) log B\) 
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(11) 



1+ E ^/(S^r^xpj-^logn 

-(n/2)log^. 



Similarly, for the true model Mt we can write 
P(M T \y,X.) 

c T i/(gj I T)" 1 exp{-((r- l)/2)logn- (n/2)log^ T | 
" 1 + EiVi ^'i^')" 1 exp{-((i' - l)/2) logn - (n/2) logjB£,} ' 

where c,i and cti do not depend on n, and i"(£Jy) and /(S™^)" 1 are 
finite for all n. We are concerned with the limiting behavior of the ratio of 
these two probabilities, and specifically if the limit is or oo. Thus, in the 
following we can ignore the finite terms and approximate the ratio with 

(12) P(M r |y,X) * eXP \^ + 2 l0§ Wi 

because the denominators cancel. (As a curiosity, note that this formula 
provides an exact approximation to the ratio for the case when Ma = Mt, 
when its value is exactly equal to one.) 
We now have the following theorem. 

Theorem 4. In the class of linear models 971 with design matrices satis- 
fying condition (D), the intrinsic Bayesian variable selection procedure VSB 
is consistent. That is, when sampling from Mt we have that 

P(Mj\y,X) 
P(M T \y,X) 

whenever the model Mj ^ Mt ■ 



0, [Pt], 



Proof. Assuming Mt ^ M\, from Corollary l(ii), we have that 

/ 77 — T 1 T — 1 \ 

B» T |M T ~Be (-£-,— ;0,Aj, 



where 



and from Theorem 1 that 



B? j \M T ~Be(^, J —±-,\ 1 ,\ 2 } 
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where the noncentrality parameters are 

1 



(13) 



Ai 
A 2 



2a\ 



1 



2a\ 



From Theorem 2(i), we have 
B\ l T \M T ^ 



(14) 



so that 



BUMt 



B 



IT 



B 



M T 



Therefore, the expression 



l + l/{2a T )cx' T S 1T a T 

l + l/(2a T )a T S jT a T 
l + l/(2a T )a' T S 1T a T ' 



l + l/(2a T )a T Sj T a T 



n B\ T 
2 l0g ^ 



and 



< 1. 



goes to — oo with order 0(n). This means that expression (12) converges to 
zero regardless of whether T — j is positive or negative. 
When Mt = M\ , then for any j > 1 we have 



P(M i |y,X)oc J BF" 1 «exp 



3 



-logn-^logBM. 



From Corollary l(i) and Theorem 2(h) it follows that — n/2logB±j is asymp- 
totically distributed as a Gamma distribution. Therefore, for any j > 1, 
P(Mj|y,X) tends, in probability, to zero. The proof is complete. □ 

4.3. Consistency of the VSA intrinsic Bayesian procedure. In the VSA 
intrinsic Bayesian procedure we use the fact that every model Mj is nested 
in the full model M^. Then, for large values of n the posterior probability 
of model Mj in the space of models {Mj,Mk} is proportional to 



P(M j |y,X)oc5F" fc «exp 



^llogn + ^log^ 



Similarly, for the true model Mt we have 



P(Mr|y,X)<xflF£ fc «exp 



k-T 



n 



■logn + -logfl? fc 
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Thus, the ratio of Bayes factors can be approximated by 

P(Mj |y,X) BF* k fT-j n B^ T 

- oc « exp — - — log n + — log — 



P(M T |y,X) "V 2 & 2 & Bf. 

where the last expression is exactly that given in (12) so that it tends to 
zero for any j > 1. We thus have the following corollary to Theorem 4. 

Corollary 2. In the class of linear models 9Jt with design matrices 
satisfying condition (D), the intrinsic Bayesian variable selection procedure 
VSA is consistent. That is, when sampling from Mt we have that 

P(Mj\y,X) 
P(M T \y,X.) 

whenever the model Mj ^ My . 



0, [Pt], 



Recall that in Section 2.1 we noted that for VSA, the probabilities 



BF r ' 

P(M l \y,X) = * k , M^Wl, 

1 + l^i'^k i'k 



were not true posterior probabilities of the models in the class 9JT. However, 
from Corollary 2, this set of probabilities [utilized as a tool for variable selec- 
tion in Casella and Moreno (2006)], is a consistent sequence of probabilities. 
Further, we recall that the ordering of the models they provide is exactly the 
same as that given by the VSA pairwise variable selection. Therefore, the 
intrinsic model posterior probabilities from above form a set of consistent 
probabilities in the class of all linear models 9JT. 

4.4. Extensions. The consistency of the intrinsic Bayesian variable se- 
lection procedure for the class of linear models can be extended to any other 
Bayesian procedure for a wide class of prior distributions. We observe that 
all we have used to prove consistency of the intrinsic Bayesian procedures is 
the Schwarz approximation, and the distribution of the ratio of the residuals 
of two nested linear models when sampling from a linear model that does 
not necessarily coincide with any of the two. Therefore, for any prior for 
which the Schwarz approximation for linear models is valid, the consistency 
of the associated Bayesian procedure can be asserted. Hence, we can prove 
the following theorem. 

Theorem 5. In the class of linear models SDt with design matrices sat- 
isfying condition (D), assume that the priors 7Tj, ttj for any i,j, are such 
that 

0< hm — — — — <oo, [P T , 
n-oovr- ( aj ,aj) 
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where and are the respective MLEs. Then the Bayesian variable 

selection procedure is consistent, that is, when sampling from Mt € 9Jt, we 
have that 

P(M T \y,X) U ' [ tU 
whenever the model Mj ^ Mt ■ 

We note that priors of the form ir^(ai,af) = Cija\, where q is a positive 
number, which includes the reference priors for q = 1 and the Jeffreys priors 
for q = i, satisfy the condition required in Theorem 5. Indeed, from (14), it 
follows that 

Inn limffi V 



Ci 



q ' 2 (2a T + a' T S jT a T \ 



q/2 



{2a T + a T S iT a T ) ' ^ 



-3 

which clearly is a real positive quantity. 

Hence, even though for finite sample sizes the above priors only provide 
Bayes factors defined up to a multiplicative constant, asymptotically they 
behave consistently. 

5. Small sample comparisons. Although for large sample sizes the vari- 
able selection procedure based on the Bayes factor for intrinsic priors is 
equivalent to that based on the Schwarz approximation, an open question 
is how good the Schwarz asymptotic approximation and the Bayes factor 
for intrinsic priors behave for small or moderate sample sizes. To answer 
this question we recall that, in the case of encompassing from below, the 
ordering of the models provided by the pairwise intrinsic model posterior 
probabilities 

B n 

p ^\y^ = YT B j 1 fori - 2 

is exactly the same as that provided by the intrinsic model posterior prob- 
abilities in the whole space 9Jt. 

Therefore, for comparing the intrinsic Bayes factor £?|J- and the Schwarz 
approximation S^j for any i and j it is enough to compare and for 
j > 2. It seems appropriate to compare B^ and in a probabilistic scale, 
that is, to compare the intrinsic posterior model probability P(Mj |y,X) and 
the Schwarz approximation posterior probability 

P 5 (M J |y,X) = TT ^- for j>2. 
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Table 1 

Type I error probabilities for the intrinsic procedure and the Schwarz approximation. In 
each cell, the left probability is the Type I error of the intrinsic procedure and the right 
probability is the Type I error of the Schwarz approximation 







n = 7 


n = 10 


n = 15 


n = 40 


n = 80 


3 


= 2 


0.16, 0.26 


0.13, 0.19 


0.10, 0.130 


0.06, 0.06 


0.04, 0.04 


3 


= 3 


0.19, 0.33 


0.14, 0.20 


0.099, 0.114 


0.04, 0.03 


0.02, 0.02 


3 


= 4 


0.23, 0.42 


0.16, 0.22 


0.104, 0.102 


0.03, 0.02 


0.02, 0.02 


3 


= 5 


0.29, 0.55 


0.18, 0.25 


0.111, 0.097 


0.03, 0.01 


0.01, 0.002 


3 


= 6 


0.40, 0.75 


0.21, 0.31 


0.121, 0.097 


0.03, 0.006 


0.01, 0.001 


3 


= 9 




0.41, 0.71 


0.17, 0.15 


0.03, 0.002 


0.007, ~0 


3 


= 12 






0.26, 0.36 


0.04, 0.001 


~ 0, ~ 


3 


= 38 








0.32, 0.46 


0.001, ~0 


3 


= 78 










0.32, 0.44 



A model selection procedure operates by choosing the model with the 
highest value of the criterion, so in our case this is equivalent to accepting 
model Mj, and hence rejecting M\, when the posterior probability of Mj 
is greater than 1/2. It is important to realize that this is not the way a 
classical frequentist hypothesis test is set up. In the classical case a test is 
calibrated to a specified Type I error a, and then the power is examined. 
The model selector is defined by the decision rule, and for the given rule 
we can examine the resulting Type I and II errors to assess how the model 
selector is controlling them. 

We recall that both the intrinsic posterior probability P(Mj |y, X) and 
the Schwarz approximation P s (Mj |y, X) depend on the sample observations 
(y, X) through the statistic B^. Therefore, any point in the regions 

R Xj (n) = {By : P(Mj\B^) > 1/2} and 
R s lj (n) = {B? j :P s (M j \B? j )> 1/2} 

contain empirical evidence in favor of Mj under the intrinsic Bayesian pro- 
cedure and the Schwarz approximation. 

Since M\ is nested in Mj for any j > 2, it follows that Rij(n) C (0,1), 
and Rfj(n) C (0,1). Furthermore, R\j(n) and R2j(n) are intervals since 
P(Mj\B1j) and P (Mj\B±j) are monotone increasing functions of B\j. 

The distribution of B™j is easily computed (see Corollary 1), and we can 
examine the Type I errors of the intrinsic Bayesian variable selection pro- 
cedure and the Schwarz approximation, respectively. For a range of values 
of j and sample sizes n> j, Table 1 presents the Type I error probabilities 
under the intrinsic Bayesian procedure and the Schwarz approximation. 

We see in Table 1 that for small sample sizes the Schwarz approximation 
has a very high Type I error (as high as 75%), which soon becomes very 



o 



d 



CD 

d 



d 



CM 

d 



o 

d 



n 

Fig. 1. For j ' — 5 and n = 6, . . . ,40, Type I errors and power curves of the intrinsic 
procedure (solid) and Schwarz approximation (dashed) as a function of n. The power 
curves are computed for noncentrality parameter A = 10. 

small as re increases. Thus, the Schwarz approximation will be biased away 
from the null model for small re, or more generally, in the cases where j is 
close to re. As n increases the Type I error goes rapidly to 0, and the Schwarz 
approximation will then be biased toward the null model. In contrast, the 
intrinsic procedure has a less variable Type I error, being smaller than that 
of the Schwarz approximation for small re and somewhat larger for large re. 

Examination of Figure 1 shows a very interesting story. There, we plotted 
Type I errors and power as a function of re for j = 5, which was chosen as a 
representative case. Note that the decrease in the power, as a function of re, 
reflects the fact that the Type I error decreases as a function of re. 

For small re the Schwarz approximation has higher power resulting from 
its large Type I error, while the intrinsic procedure tends to moderate both 
errors. As re increases, both Type I errors decrease, with the more dramatic 
decrease being that of the Schwarz approximation. The Type I errors cross 
at re = 13, and for n > 13 the intrinsic procedure has higher power, reaching 
0.573 at re = 40 versus 0.385 for the Schwarz approximation. The interesting 
point is that, although the intrinsic procedure has higher Type I error, both 
Type I errors are very small (e.g., at n = 29 they are 0.05 and 0.02). However, 
the effect of Schwarz approximation, by driving the Type I error so close to 
zero, is a dramatic decrease in power. Thus, the intrinsic procedure does a 
much better job of controlling the errors. By moderating the Type I error 
it avoids the faults of the Schwarz approximation, which has very large 
Type I error for small re, and for large re decreases the Type I error to an 
unnecessarily low value to the detriment of its power. 
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6. Discussion. It has long been known that when choosing between two 
models, when one of which is true, selecting according to Bayes factors pro- 
vides a consistent decision function in the sense that the frequentist prob- 
ability of selecting the true model approaches 1 as n — > oo. In this paper, 
for the case of variable selection, we have extended this result to selection 
among an entire class of linear models and a wide class of priors, and shown 
that selecting according to Bayes factors yields a decision rule with the prop- 
erty that the frequentist probability of selecting the true model approaches 
1 as n — ► Co, and the frequentist probability of selecting any other model 
approaches as n — ► oo. 

We have, specifically, worked with intrinsic priors, although our results 
hold for a wide class of priors. However, intrinsic priors provide a type of 
objective Bayesian prior for the testing problem. They seem to be among 
the most diffuse priors that are possible to use in testing, without encoun- 
tering problems with indeterminate Bayes factors, which was the original 
impetus for the development of Berger and Pericchi (1996). Moreover, they 
do not suffer from "Lindley paradox" behavior. Thus, we believe they are a 
very reasonable choice for experimenters looking for an objective Bayesian 
analysis with a frequentist guarantee. This is very much in the spirit of the 
calibrated Bayesian, as described by Little (2006). 

Intrinsic priors have been used successfully in both variable selection 
and changepoint problems [Casella and Moreno (2006), Giron, Moreno and 
Martinez (2006), Giron, Moreno and Casella (2007)], where excellent small- 
sample properties were exhibited. Some other properties of the variable se- 
lection rules considered here are as follows: 

1. All models Mj that contain model Mt, and hence have Ai = [see (13)], 
will have the same value of B™ t \Mt in (14). This means that the posterior 
probability of models Mj that contain model Mt (11) is decreasing in j, and 
models with larger j will have smaller probabilities. Thus, VSB will tend to 
select smaller models. The same holds for VSA. 

2. To gain further insight in the large-sample approximation of the 
Bayes factors for comparing arbitrary models, say Mj and Mj/, we look 
a bit closer at the importance of some geometric considerations in the space 
of all models, as the one played by a distance that we can define between a 
generic model Mj and the true, though unknown, model Mt- 

If we define this distance as 



we note that it is equal to if either Mj = Mt or Mt is nested in Mj ; oth- 
erwise, it is strictly positive by condition (D). Also, if model Mj is nested in 
Mj then 5{Mi,Mt) < 5(Mj , Mt), because Hj — Hj is positive semidefinite. 



5{Mj,M T ) 



a'rpSjTdT 
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3. From (11) we have that 



P(M f \y,X) 



cxp 



-J, n tS Xj 

log n log — 

2 s 2 



and from (14) 



log 



B 



1.7 



fin 



M T -> log 



l + (5(M j ,M T )/2 
l + <5(M j /,M r )/2' 



Hence, 



W-|y,X) 



P(M j / |y,X 
and it follows that 
P(Mf|y,X 



Mr ~ ex P 



■ logn 



n l + J(Mj,M r )/2 \ 
2 ° g l + 5(M i /,M T )/2j 



P(M,v|y,X) 



M T 



0. 

oo, 



if ^(Mj/ , M T ) < 8(Mj, Mr) , 
if <y(Mjv , Mr ) > S^Mj, Mr). 



Thus, the model that is closer to Mr is always preferred. 

4. If the distance from both models to the true one is the same, that 
is, 5(Mji , Mr) = <5(M,-,Mr), then the limiting behavior of the quotient of 
posterior model probabilities only depends on the number of covariates of 
the models. We have that 



P(Mj\y,X) 



P(M f \y,X) 




if 5(M f ,M T ) 
iiS(M f ,M T ) 
if <$(Afjv, Mr) 



S(Mj,M T ) and /< j, 
8{Mj,M T ) and j'=j, 
^(M^Mr) and f>j. 



(15) 

When the true model is nested in Mj and My, so 5{Mjj,M T ) = 5(Mj,M T ), 
(15) says that the smaller model is then preferred. Thus, the intrinsic Bayes 
procedure naturally leans toward a more parsimonious solution. 

5. We also address the important point of what happens when the true 
model is a linear model but it does not belong to This happens when, for 
example, the true model includes some covariates or interactions among the 
existing or new ones not previously considered. From the preceding discus- 
sion it follows easily that the preference of the models in 50? solely depends 
on their distances to the true model, regardless of whether the latter does 
or does not belong to the set of models we are considering. 

Lastly, we note that implementation of the model selection procedure 
is best done with a stochastic search algorithm. As there are 2 fc ~ 1 possi- 
ble models, enumeration quickly becomes infeasible. We have implemented 
Metropolis-Hastings driven stochastic searches for both variable selection 
[Casella and Moreno (2006)] and changepoint problems [Giron, Moreno and 
Casella (2007)] with good results. 
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APPENDIX: DERIVATION OF THE INTRINSIC BAYES FACTOR 

Here we outline the calculations to justify the intrinsic Bayes factor of (7). 
For comparing the models in (6) with 

^(P^Oj) = J > 7r / (/3 J -,cJj|Q:i,cJ i )vr JV (a;, <7j) dcKj dcr, 

and 

the Bayes factor is given by (7). 

The derivation of this expression is similar to that in Casella and Moreno 
(2006), but there different default priors were used and a generic Wj was 
derived. Here, we are using the reference prior tt n (rj, a) = c/a instead, which 
seems to be a better choice as discussed in Giron et al. (2006), and thus we 
obtain a slightly different Bayes factor given by 

_ 2 



7T 



where 



V2 dip 



o lAfoOlVaiBfoOlVa^n-i' 

2 ,„t i v nr-lv' 



B(<p) = sin 2 ipl n + X ; W ; 'X; 
AM = X^B(^)- 1 X 



and 



E(p) = y'(B^)- 1 - BC^J-'XiA^-^BH-V- 
Now, taking 

we have, after some algebra, the following equalities: 
(i) 

B{vyl = shrV " n+(i + l)sinV Hj ' ' 

(ii) 

3+1 
n + ( j + 1) sin 2 ip 
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(iii) 

y a i \-W n + (j + 1) sin 2 cp 
J + 1 

(iv) 

E(<p) = — 5— f 7 t — 5-i255,- + iSflS^ , 

n + (j + l)sinV V(i + l)sinV J 7 

(v) 

|A(y)| = ( . 5 YlX&l, 

\n + (j + l)sur<p/ 

(vi) 

,„/ si / 9 ,„ „• f n + (7 + 1) sin 2 ip\ 3 
|B(^)| = (sinV) n -^ -) ■ 

Plugging these values into In and making some simplifications we get ex- 
pression (7). 
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